User-generated video content has grown tremendously fast to the point of outpacing professional content creation. In this work we\r\ndevelop methods that analyze contextual information of multiple user-generated videos in order to obtain semantic information\r\nabout public happenings (e.g., sport and live music events) being recorded in these videos. One of the key contributions of this\r\nwork is a joint utilization of different data modalities, including such captured by auxiliary sensors during the video recording\r\nperformed by each user. In particular, we analyze GPS data, magnetometer data, accelerometer data, video- and audio-content\r\ndata. We use these data modalities to infer information about the event being recorded, in terms of layout (e.g., stadium), genre,\r\nindoor versus outdoor scene, and the main area of interest of the event. Furthermore we propose a method that automatically\r\nidentifies the optimal set of cameras to be used in a multicamera video production. Finally, we detect the camera users which fall\r\nwithin the field of view of other cameras recording at the same public happening.We show that the proposed multimodal analysis\r\nmethods perform well on various recordings obtained in real sport events and live music performances
Loading....